feat: make async in-flight task cap configurable by eric-tramel · Pull Request #699 · NVIDIA-NeMo/DataDesigner

eric-tramel · 2026-05-21T19:04:49Z

📋 Summary

Adds a public RunConfig.max_in_flight_tasks setting for the async scheduler's task-lease capacity and raises the default from 256 to 1024. The configured value is now the singular scheduler work cap: model-stage task admission and adaptive row-group queue backpressure follow it, while model API request concurrency remains controlled separately by max_parallel_requests and request admission.

🔗 Related Issue

N/A

🔄 Changes

Add RunConfig.max_in_flight_tasks with a default of 1024 and validation that it is at least 1.
Wire the async dataset builder to pass the configured in-flight task cap into AsyncTaskScheduler.
Set scheduler model-stage task admission from max_in_flight_tasks directly, without the old default floor or aggregate request headroom.
Collapse the duplicate scheduler default constant so the engine uses DEFAULT_IN_FLIGHT_TASK_CAPACITY for this concept.
Make adaptive row-group queue backpressure derive from max_in_flight_tasks, rather than an inflated model-stage admission value.
Update capacity reporting to mark submission capacity as sourced from run_config.
Add regression coverage for config validation, DataDesigner run config persistence, builder propagation, lowered model-stage admission, and row-group queue guard behavior.

🧪 Testing

make test passes (not run; focused tests below)
Unit tests added/updated
E2E tests added/updated (N/A — no E2E surface)

Focused checks run after rebasing onto latest origin/main:

uv run --group dev pytest packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py::test_run_config_default_non_inference_max_parallel_workers packages/data-designer/tests/interface/test_data_designer.py::test_run_config_setting_persists (110 passed)
uv run ruff format --check ... on touched files
uv run ruff check --output-format=full ... on touched files
git diff --check
Package-source search confirms no legacy submitted-task knob references or duplicate scheduler cap constants

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (N/A — runtime config surface only)

github-actions · 2026-05-21T19:07:14Z

Review: PR #699 — feat: make async task cap configurable

Summary

Promotes the previously hard-coded async scheduler submission cap to a public RunConfig.max_submitted_tasks field (default 1024, validated >= 1) and raises the default from 256 to 1024. The async dataset builder now reads the value from run_config and forwards it to AsyncTaskScheduler; max_model_task_admission is computed as max(DEFAULT_TASK_POOL_SIZE, max_submitted_tasks, 2 * aggregate_max_parallel_requests) so model-heavy runs are not capped by the old 256 floor. Capacity reporting now sources submission_capacity from "run_config" rather than "dataset_builder". Total: +48/-8 lines across 7 files.

Findings

Correctness — looks right

DEFAULT_TASK_POOL_SIZE in async_scheduler.py:80 is now sourced from DEFAULT_SUBMISSION_CAPACITY in task_admission.py:39, giving the two constants a single source of truth. Both default to 1024, matching the new RunConfig default — invariant holds.
The max_model_task_admission formula at dataset_builder.py:1028-1032 preserves the existing floor semantics: model task admission is at least DEFAULT_TASK_POOL_SIZE even if the user sets max_submitted_tasks below 1024. This matches the PR description's intent.
Field validation (ge=1) is consistent with neighboring fields like non_inference_max_parallel_workers.
submission_capacity source string changed to "run_config" accurately reflects the new wiring; downstream consumers of AsyncCapacityPlan are unlikely to depend on the literal string, but worth verifying if any dashboards parse it.

Behavior change worth calling out

The default jumps from 256 → 1024 (4×). Users who never set this knob will see significantly more concurrent task leases. The PR notes this as intended (so model-heavy runs don't get throttled by the old floor), but consider:

Memory pressure for users with many columns / large row groups: each in-flight task carries scheduler bookkeeping plus any per-task buffers.
Existing performance baselines or capacity-plan snapshots in tests/recipes may shift; the PR updates the tests that assert submission_capacity directly, but external consumers (logs, telemetry dashboards, runbooks) could be affected.

Not blocking, but a release note / changelog entry beyond the PR body would help downstream users notice.

Naming / minor consistency

DEFAULT_TASK_POOL_SIZE and DEFAULT_SUBMISSION_CAPACITY are now aliases. Keeping both is fine for layering reasons (the engine module re-exports the config-layer constant), but a one-line comment at async_scheduler.py:80 explaining "alias retained for legacy import surface" would prevent a future refactor from quietly dropping it.
The docstring in run_config.py:136-137 reads "Maximum number of scheduler task leases, and therefore async scheduler worker tasks…" — the "and therefore" is a bit awkward. Consider: "Maximum number of in-flight scheduler task leases (i.e., the cap on concurrent async scheduler worker tasks)."

Test coverage — solid

test_run_config.py: default value, custom value, validation rejection — all three cases covered.
test_async_builder_integration.py: confirms the value flows through _setup_async_pipeline to AsyncTaskScheduler.__init__ (1536 → both max_submitted_tasks and max_model_task_admission, since 1536 > 2 * aggregate in the test).
test_data_designer.py::test_run_config_setting_persists: round-trip through DataDesigner.set_run_config.
test_dataset_builder.py: default assertion alongside other run-config defaults.

One gap: there's no test that exercises the floor behavior — i.e., a user setting max_submitted_tasks below DEFAULT_TASK_POOL_SIZE (say 64) and confirming max_model_task_admission stays at 1024. The integration test uses 1536, which exercises the upper path. A small unit test at the builder level locking in the floor would prevent a future refactor from silently dropping it.

Style / project conventions

Absolute imports preserved.
from __future__ import annotations present in changed source files.
Field uses Field(default=…, ge=…, description=…) matching the surrounding style.
No reverse imports introduced; the new constant lives in the engine package and is consumed there + read out of run_config (config → engine direction is preserved).

Security / performance

No security-sensitive surface touched.
Performance: the 4× default raise is the main concern (covered above). Validation is bounded only on the lower side (ge=1); there is no upper bound. A user could set max_submitted_tasks=1_000_000 and OOM. This matches existing fields like buffer_size (also unbounded above), so it's consistent with the project's posture, but a soft sanity ceiling or warning log on unusually large values would be a small defensive improvement.

Verdict

Approve-with-nits. The wiring is clean, tests cover the happy path and validation, and the layering respects the config → engine direction. Suggested non-blocking follow-ups:

Add a unit test that pins the max_model_task_admission floor when max_submitted_tasks is set below DEFAULT_TASK_POOL_SIZE.
Tighten the max_submitted_tasks docstring and add a one-line comment on the DEFAULT_TASK_POOL_SIZE/DEFAULT_SUBMISSION_CAPACITY alias.
Consider a changelog/release-note line flagging the default bump from 256 → 1024 for users with capacity-sensitive deployments.

greptile-apps · 2026-05-21T19:10:14Z

Greptile Summary

This PR exposes RunConfig.max_in_flight_tasks as a user-facing configuration for the async scheduler's task-lease capacity, raises the default from 256 to 1024, and simplifies internal plumbing by removing the old aggregate-request-headroom multiplier and duplicate constant.

RunConfig gains a max_in_flight_tasks: int = Field(default=1024, ge=1) field; dataset_builder.py reads it directly and passes it as both max_in_flight_tasks and max_model_task_admission to AsyncTaskScheduler, replacing the prior max(DEFAULT_TASK_POOL_SIZE, 2 × aggregate_max_parallel_requests) computation.
DEFAULT_IN_FLIGHT_TASK_CAPACITY = 1024 is consolidated in task_admission.py and imported into async_scheduler.py; DEFAULT_TASK_POOL_SIZE and MODEL_TASK_ADMISSION_HEADROOM_MULTIPLIER are deleted.
The adaptive row-group queue-guard formula is simplified from max(max_submitted_tasks × 4, max_model_task_admission × 2) to max_in_flight_tasks × 4, and the "local" key is removed from the TaskAdmissionConfig resource-limits map (it was never requested by any task type, so this is a no-op for admission control).

Confidence Score: 5/5

Safe to merge — all changes are additive config plumbing with no side-effects on existing behavior for users who do not set the new field.

The new max_in_flight_tasks field is wired end-to-end from RunConfig through DatasetBuilder into AsyncTaskScheduler consistently, the old aggregate-headroom path is fully removed, and the 'local' key deleted from resource_limits was never requested by any task type so its removal is a no-op. The default rises from 256 to 1024, which increases throughput headroom but does not change correctness. Test coverage spans config validation, round-trip persistence, builder propagation, model-stage admission limits, and the new queue-guard formula.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-config/src/data_designer/config/run_config.py	Adds max_in_flight_tasks: int = Field(default=1024, ge=1) with docstring and Pydantic validation; straightforward field addition.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py	Renames max_submitted_tasks → max_in_flight_tasks, removes DEFAULT_TASK_POOL_SIZE/MODEL_TASK_ADMISSION_HEADROOM_MULTIPLIER, simplifies queue guard to max_in_flight_tasks * 4, and removes unused 'local' key from resource_limits.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py	Replaces the aggregate model-request-headroom multiplier with a direct read of run_config.max_in_flight_tasks; sets both max_in_flight_tasks and max_model_task_admission to the same value.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_admission.py	Extracts DEFAULT_IN_FLIGHT_TASK_CAPACITY = 1024 constant and updates TaskAdmissionConfig.submission_capacity default; no logic changes.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py	Renames all max_submitted_tasks usages to max_in_flight_tasks and adds test_scheduler_adaptive_row_group_queue_guard_uses_in_flight_task_cap covering the new guard formula.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py	Updates integration test to assert max_in_flight_tasks and max_model_task_admission are both propagated from run_config, and guards against any future accidental call to get_aggregate_max_parallel_requests.
packages/data-designer/tests/interface/test_data_designer.py	Extends test_run_config_setting_persists to verify max_in_flight_tasks round-trips through all three set-config calls.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    RC[RunConfig\nmax_in_flight_tasks=1024] -->|read| DB[DatasetBuilder\n_prepare_async_run]
    DB -->|max_in_flight_tasks\nmax_model_task_admission| ATS[AsyncTaskScheduler.__init__]
    ATS -->|submission_capacity=max_in_flight_tasks\nresource_limits llm_wait=max_model_task_admission| TAC[TaskAdmissionConfig]
    ATS -->|model_group_limit_cap=max_model_task_admission| TSR[TaskSchedulingResolver]
    ATS -->|_max_in_flight_tasks| QG[Queue Guard\nmax_in_flight_tasks × 4]
    TAC --> CTRL[TaskAdmissionController\nleases submission + llm_wait]
    TSR -->|admitted_limit per model group| CTRL
    QG -->|blocks new row-group admission| RGA[Adaptive Row-Group Admission]

_{Reviews (5): Last reviewed commit: "feat: make async in-flight task cap conf..." | Re-trigger Greptile}

nabinchha

Thanks for putting this together, @eric-tramel — clean, well-scoped change.

Summary

Adds RunConfig.max_in_flight_tasks (default 1024, ge=1) so users can tune the async scheduler's task-lease capacity, raises the default from 256 to 1024, and wires the value end-to-end through DatasetBuilder into AsyncTaskScheduler (including the model-stage admission floor and capacity reporting). The implementation matches the stated intent in the PR description and keeps the engine/config layering intact — config grows a new field, engine consumes it.

Findings

Warnings — Worth addressing

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:1064-1068 — max_model_task_admission floor can surprise users tuning down

What: When a user explicitly lowers max_in_flight_tasks (e.g. 64), the floor still pins max_model_task_admission to max(DEFAULT_TASK_POOL_SIZE=1024, 64, 2 * aggregate) = 1024. So the llm_wait resource keeps a 1024-slot ceiling even though the user asked for a much smaller cap.
Why: This is consistent with the pre-PR behavior (the floor was 256 before), so it isn't a regression — but combined with the new public knob, it could read as "I set the cap to 64, why is the LLM stage still letting through 1024?" The local and submission resources do drop with the user's value; only llm_wait keeps the floor. The new docstring on max_in_flight_tasks doesn't hint at that asymmetry.
Suggestion: Either (a) drop the DEFAULT_TASK_POOL_SIZE term from the max(...) (use max(max_in_flight_tasks, 2 * aggregate)) so the user's setting always governs, or (b) add a one-line note to the max_in_flight_tasks docstring that the model-stage admission floor stays at the engine default. (a) is cleaner if there's no specific reason to keep the floor when the user has opted in to a smaller cap.

Suggestions — Take it or leave it

packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py:80 — Two names for the same constant

What: DEFAULT_TASK_POOL_SIZE is now defined as DEFAULT_IN_FLIGHT_TASK_CAPACITY. Both names refer to the same 1024 value but live in two modules.
Why: It's preserved as an alias (likely for downstream callers and tests that import it), but two names for one concept costs a bit of reading time when navigating the scheduler.
Suggestion: As a follow-up, consider collapsing to a single name (DEFAULT_IN_FLIGHT_TASK_CAPACITY reads well in context) once you're comfortable touching the in-tree import sites. Not blocking — and worth doing separately so this PR stays focused.

packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py:194 — Test name vs. what it now asserts

What: test_prepare_async_run_enables_request_pressure_advisory now also asserts max_in_flight_tasks / max_model_task_admission propagation.
Why: Searching for "what tests cover the new knob" won't surface this test by name. Both assertions are valid checks of the same code path, but the test name only describes the first.
Suggestion: Either split into a focused test_prepare_async_run_propagates_max_in_flight_tasks or rename to something like test_prepare_async_run_propagates_scheduler_kwargs. Tiny ergonomics nit.

packages/data-designer-config/src/data_designer/config/run_config.py:175-182 — Field(description=...) duplicates the Attributes: docstring

What: The class docstring already has a full max_in_flight_tasks: entry, and Field(description=...) repeats a shorter version of the same text.
Why: Other simple numeric fields in RunConfig (e.g. buffer_size, non_inference_max_parallel_workers) don't carry a description=. The redundancy is harmless but slightly inconsistent with the surrounding pattern.
Suggestion: Either drop the description= (keep just Field(default=1024, ge=1)) or, if you'd rather keep schema-level descriptions, leave it as-is. Pure style call.

What Looks Good

The new knob's docstring is the standout — it explicitly disambiguates max_in_flight_tasks from max_parallel_requests, which is exactly the confusion this change could otherwise introduce.
Coverage is layered well: config-level validation (test_run_config_*), interface persistence (test_run_config_setting_persists), and builder propagation (the integration test). Each layer asserts the right thing without overlap.
The capacity-source flip from dataset_builder → run_config on submission_capacity is a small but accurate provenance change, and it's still type-safe under CapacityValueSource.
Renaming max_submitted_tasks → max_in_flight_tasks is a real readability win — "submission" was just one phase, and the new name captures what the cap actually governs.

Verdict

Ship it (with nits) — the one Warning is non-blocking (a behavior nuance worth either tweaking or documenting), and the rest are pure style. Nothing critical here.

johnnygreco

@eric-tramel I rechecked the review feedback against the current PR head. The substantive/actionable items look addressed: max_model_task_admission now follows max_in_flight_tasks directly, the old aggregate/floor behavior is gone, and the focused verification passed on my side. The remaining notes are optional nits/follow-ups, not blockers. Approving.

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>

johnnygreco

Re-checked after the branch update / conflict resolution. The scheduler task-cap behavior still matches the previously reviewed version: max_model_task_admission follows max_in_flight_tasks directly, and the focused propagation coverage is still in place. Re-approving.

eric-tramel requested a review from a team as a code owner May 21, 2026 19:04

eric-tramel temporarily deployed to agentic-ci May 21, 2026 19:05 — with GitHub Actions Inactive

eric-tramel force-pushed the codex/configurable-scheduler-task-cap branch from 1b75986 to 5981281 Compare May 21, 2026 19:17

eric-tramel changed the title ~~feat: make async task cap configurable~~ feat: make async in-flight task cap configurable May 21, 2026

eric-tramel force-pushed the codex/configurable-scheduler-task-cap branch 2 times, most recently from 1bcb049 to 5ae93df Compare May 21, 2026 19:21

nabinchha previously approved these changes May 21, 2026

View reviewed changes

eric-tramel self-assigned this May 22, 2026

eric-tramel dismissed nabinchha’s stale review via eccce53 May 22, 2026 01:46

eric-tramel force-pushed the codex/configurable-scheduler-task-cap branch from 9c7d4d2 to eccce53 Compare May 22, 2026 01:46

eric-tramel requested a review from nabinchha May 22, 2026 02:01

johnnygreco previously approved these changes May 22, 2026

View reviewed changes

feat: make async in-flight task cap configurable

a0fa27c

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>

eric-tramel dismissed johnnygreco’s stale review via a0fa27c May 22, 2026 14:31

eric-tramel force-pushed the codex/configurable-scheduler-task-cap branch from eccce53 to a0fa27c Compare May 22, 2026 14:31

eric-tramel requested a review from johnnygreco May 22, 2026 14:33

johnnygreco approved these changes May 22, 2026

View reviewed changes

eric-tramel merged commit c911e2c into main May 22, 2026
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make async in-flight task cap configurable#699

feat: make async in-flight task cap configurable#699
eric-tramel merged 1 commit into
mainfrom
codex/configurable-scheduler-task-cap

eric-tramel commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

nabinchha left a comment

Uh oh!

johnnygreco left a comment

Uh oh!

johnnygreco left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eric-tramel commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented May 21, 2026

Review: PR #699 — feat: make async task cap configurable

Summary

Findings

Correctness — looks right

Behavior change worth calling out

Naming / minor consistency

Test coverage — solid

Style / project conventions

Security / performance

Verdict

Uh oh!

greptile-apps Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Summary

Findings

Warnings — Worth addressing

Suggestions — Take it or leave it

What Looks Good

Verdict

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eric-tramel commented May 21, 2026 •

edited

Loading

greptile-apps Bot commented May 21, 2026 •

edited

Loading